Lightly-Supervised Attribute Extraction

نویسندگان

  • Kedar Bellare
  • Partha Pratim Talukdar
  • Giridhar Kumaran
  • Fernando Pereira
  • Mark Liberman
  • Andrew McCallum
  • Mark Dredze
چکیده

Web search engines can greatly benefit from knowledge about attributes of entities present in search queries. In this paper, we introduce lightly-supervised methods for extracting entity attributes from natural language text. Using these methods, we are able to extract large numbers of attributes of different entities at fairly high precision from a large natural language corpus. We compare our methods against a previously proposed pattern-based relation extractor, showing that the new methods give considerable improvements over that baseline. We also demonstrate that query expansion using extracted attributes improves retrieval performance on underspecified information-seeking queries. 1 Attributes in Web Search Web search engines receive numerous queries requesting information, often focused on a specific entity, such as a person, place or organization. These queries are sometimes general requests, such as “bio of George Bush,” or specific requests, such as “new york mayor.” Accurately identifying the entity (new york) or related attributes (mayor) can improve search results in several ways [1]. For example, knowledge of attributes and entities can identify a query as being a factual request [1, 2]. Query expansion using known attributes of the entity can also improve results [3]. Additionally, an engine could suggest alternative queries based on attributes. If a user searches for just “Craig Ferguson” and “shows” is a known attribute of the entity ”Craig Ferguson”, then an alternative query suggestion could be “Craig Ferguson shows” which may guide the user to more informative results. The widely explored technique of pseudo relevance feedback can also benefit from a known list of entities and attributes [4]. Some view entity and attribute extraction as a primary building block for the automatic creation of large scale knowledge bases aimed at addressing these issues [1]. The first step towards improving search results with attributes is to create lists of entities and attributes. Towards that end, we propose new algorithms that, beginning with a small seed set of entities and attributes, learn to extract new entities and attributes from a large corpus of text. We adopt a bootstrapping approach, where the inputs for our learning algorithms are a large unlabeled corpus and the small seed set containing an entity type of interest, such as seed pairs automatically extracted from query logs [1]. The seed pairs are matched against the corpus to create training instances for the learning algorithms. The algorithms exploit a wide range of instance features to alleviate the effects of noise and sparseness. The algorithms produce a large list of entities and associated attributes, which can be directly applied towards improving web search. This paper proceeds as follows. We begin with some background on attribute extraction and web search applications. We then outline our extraction algorithms. Some examples and evaluations of extracted attributes and entities follow.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Selecting Domain-Specific Concepts for Question Generation With Lightly-Supervised Methods

In this paper we propose content selection methods for question generation (QG) which exploit domain knowledge. Traditionally, QG systems apply syntactical transformation on individual sentences to generate open domain questions. We hypothesize that a QG system informed by domain knowledge can ask more important questions. To this end, we propose two lightly-supervised methods to select salient...

متن کامل

Joint Inference over a Lightly Supervised Information Extraction Pipeline: Towards Event Coreference Resolution for Resource-Scarce Languages

We address two key challenges in end-to-end event coreference resolution research: (1) the error propagation problem, where an event coreference resolver has to assume as input the noisy outputs produced by its upstream components in the standard information extraction (IE) pipeline; and (2) the data annotation bottleneck, where manually annotating data for all the components in the IE pipeline...

متن کامل

Improving Lightly Supervised Training for Broadcast Transcriptions

This paper investigates improving lightly supervised acoustic model training for an archive of broadcast data. Standard lightly supervised training uses automatically derived decoding hypotheses using a biased language model. However, as the actual speech can deviate significantly from the original programme scripts that are supplied, the quality of standard lightly supervised hypotheses can be...

متن کامل

Improving lightly supervised training for broadcast transcription

This paper investigates improving lightly supervised acoustic model training for an archive of broadcast data. Standard lightly supervised training uses automatically derived decoding hypotheses using a biased language model. However, as the actual speech can deviate significantly from the original programme scripts that are supplied, the quality of standard lightly supervised hypotheses can be...

متن کامل

A semi-supervised active learning algorithm for information extraction from textual data

In this article we present a semi-supervised active learning algorithm for pattern discovery in information extraction from textual data. The patterns are reduced regular expressions composed of various characteristics of features useful in information extraction. Our major contribution is a semi-supervised learning algorithm that extracts information from a set of examples labeled as relevant ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007